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Abstract 

A common assumption of stereo vision researchers is that the goal of stereo is to compute explicit 3D 
information about a scene, to support activities such as navigation, hand-eye coordination and object 
recognition. This paper suggests reconsidering what is required of a stereo algorithm, in light of the needs 
of the task that uses its output. We show that very accurate camera calibration is needed to reconstruct 
accurate 3D distances, and argue that often it may be difficult to attain and maintain such accuracy. We 
further argue that for tasks such as object recognition, separating object from background is of central 
importance. We suggest that stereo can help with this task, without explicitly computing 3D information. 
We provide a demonstration of a stereo algorithm that supports separating figure from ground through 
attentive fixation on key features. 
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1 Introduction 

The title of this article is, of course, deliberately provoca- 
tive, in part to capture the reader's attention, but in 
part also to make a point. A common assumption of 
researchers working in stereo vision is that the goal of 
stereo is to compute explicit 3D information about a 
scene, in order to support activities such as navigation, 
hand-eye coordination and object recognition. While 
there are applications in which such information can be 
accurately computed, these domains require very accu- 
rate camera calibration information. We suggest that 
in many applications, it may be difficult to attain and 
maintain such accurate information, and hence we sug- 
gest that it may be worthwhile to reconsider what is re- 
quired of a stereo algorithm, in light of the needs of the 
task that uses stereo's output. In particular, we examine 
the role of stereo in object recognition, arguing that it 
may be more effective as a means of separating objects 
from background, than as a provider of 3D information 
to match with object models. To support this argument, 
we provide a demonstration of a stereo algorithm that 
separates figure from ground through attentive fixation 
on key features, without explicitly computing actual 3D 
information. 

2 Some Stereo Puzzles 

It has been common in recent years within the computer 
vision community to consider the stereo vision problem 
as consisting of three key steps [23], [27]: 

• Identify a particular point in one image (say the 
left). 

• Find the point in the other (say right) image that 
is a projection of the same scene point as observed 
in the first image. 

• Measure the disparity (or difference in projection) 
between the left and right image points. Use knowl- 
edge of the relative orientation of the two camera 
systems, plus the disparity, to determine the actual 
distance to the imaged scene point. 

These steps are repeated for a large number of points, 
leading to a 3D reconstruction of the scene, at those 
points. 

There are many variations on this theme, including 
whether to use distinctive features such as edges or cor- 
ners as the points to match, or to simply use local 
patches of brightness values, what constraints to apply to 
the search for corresponding matches (e.g. epipolar lines, 
similar contrast, similar orientation, etc.), and whether 
to restrict the relative orientation of the cameras (e.g. 
to parallel optic axes). Nonetheless, it has been com- 
monly assumed for some time that the hard part of the 
problem is solving for the correspondence between left 
and right image features. Once one knows which points 
match, it has been assumed that measuring the dispar- 
ity is trivial, and that solving for the distance simply 
requires using the geometry of the cameras to invert a 
simple trigonometric projection. 

This sounds fine, but let's consider some puzzles about 
this approach. The first puzzle is a perceptual one, illus- 



Figure 1: Cornsweet illusion in depth. 



trated in Figure 1. This illusion is a depth variant on the 
standard Cornsweet illusion in brightness, and is due to 
Anstis et al. [2] (see also [37]). It consists of a physical 
object with two coplanar regions separated by a sharp 
discontinuity, where the regions immediately to the sides 
of the discontinuity are smoothly curved. These surfaces 
are textured with random dot paint, to make them visi- 
ble to the viewer. Subjects are then asked to determine 
whether the two planar regions are coplanar, or sepa- 
rated in depth, and if it is the latter, which surface is 
closer and by how much. Although physically the two 
surfaces are in fact coplanar, subjects consistently see 
one of the two surfaces as closer (the left side in the case 
of Figure 1). The reported error is .5 cm and is consistent 
for three different view distances: 72, 145 and 290cm. 

This is clearly surprising if one believes that the above 
description of the stereo process holds for biological as 
well as machine solutions. In particular, if the human 
system maintains a representation of reconstructed dis- 
tance, and if that representation is accessible to queries, 
then it is difficult to see how human observers could con- 
sistently make such a mistake. 

Additional stereo puzzles are provided in [40], which 
the authors use to argue that depth is not computed 
directly in humans, but is reconstructed from non-zero 
second differences in depth. As a consequence, they 
demonstrate that human stereo vision is blind to con- 
stant gradients of depth. Similar observations on the 
role of disparity gradients in reconstructing depth are 
given by [44]. 

It need not be the case that machine stereo systems 
make the same "mistakes" as human observers, but the 
existence of such an illusion for humans raises an in- 
teresting question about the basic assumptions of ap- 
proaches that reconstruct distance. 

Consider a second puzzle about the approach of 
matching features, then using trigonometry to convert 
into depth. As noted, for years stereo researchers have 
assumed that the correspondence problem was the hard 
part of the task. Once correct correspondences were 
found, the reconstruction was a simple matter of geom- 
etry. This is true in principle, but it relies on finding 



the intrinsic parameters of the camera systems and the 
extrinsic parameters relating the orientation of the two 
cameras. While solutions exist for finding these param- 
eters (e.g. [41]), such solutions appear to be numerically 
unstable [45, 43]. If one does not perform very care- 
ful calibration of the camera platform, the result will be 
very noisy reconstructed distances. 

Of course, there are circumstances in which careful 
calibration can be performed, and in these cases, ex- 
tremely accurate reconstructions are possible. A good 
example of this is automated cartographic reconstruc- 
tion from satellite imagery, where commercial systems 
can provide maps with accuracy on the order of a few 
meters, from satellite photography [19]. On the other 
hand, if the cameras are mounted on a mobile robot that 
is perturbed as it moves through the environment, then 
it may be more difficult to attain and maintain careful 
calibration. Thus, we see that there are some sugges- 
tions that human observers do not reconstruct depth, 
and some suggestions that one needs very careful cali- 
bration (which is often hard to guarantee) in order to do 
this. We will explore the calibration sensitivity issue in 
section 3. 

Given this puzzle, it is worth stepping back to ask 
what one needs from the output of a stereo algorithm. 
Aside from specialized tasks such as cartography, the two 
standard general application areas are navigation and 
recognition. Interestingly, Faugeras [8] (see also [39]) has 
recently argued that one can construct and maintain a 
representation of the scene structure around a moving 
robot, without a need for careful calibration. Moreover, 
the solution involves using relative coordinate systems 
to represent the scene, so that there is no metrical re- 
construction of the scene. 

What about object recognition? We have found it 
convenient to separate the recognition problem into three 
pieces [11]: 

• Selection: Extract subsets of the data features 
likely to have come from a single object. 

• Indexing: Look up those object models that could 
have given rise to one such selected subset. 

• Correspondence: Determine if there is a way of 
matching model features to data features that is 
consistent with a legal transformation of the model 
into the data. 

We have argued [11] that for many approaches to 
recognition, the first stage is the crucial one. In many 
cases, it reduces the expected complexity of recognition 
from exponential to low-order polynomial, and in many 
cases, it is necessary to keep the false positive rates under 
control. If we accept that the hard part of recognition is 
selection, rather than correspondence, then this has an 
interesting implication for stereo. If stereo were mainly 
oriented towards solving the correspondence problem, it 
is natural to expect that it needs to deliver accurate 3D 
data that can be compared to 3D models. But if stereo is 
mainly intended to help with the selection problem, then 
one no longer needs to extract exact 3D reconstructions, 
one simply needs stereo to identify data feature subsets 
that are roughly in the same depth range, or equivalently 



do not have large variations in disparity. We will exam- 
ine a modified stereo algorithm in section 4 that takes 
advantage of this observation. 

If one accepts that stereo is primarily for segmenta- 
tion, not for 3D reconstruction, this leads to the further 
question of whether recognition of 3D objects can be 
done without explicit 3D input data. A number of re- 
cent techniques have shown interesting possibilities along 
these lines; for example, the recent development of the 
linear combinations method [42] suggests that one could 
use stored 2D images of a model to generate an hypoth- 
esized 2D image which can then be compared to the ob- 
served image. Again, one does not need to extract exact 
3D data. It is also intriguing along these lines to observe 
that some physiological data [34, 35] may also support 
the idea of the human system solving 3D recognition 
from purely 2D views. Of course, it is possible to solve 
the recognition problem by matching reconstructed 3D 
stereo data against 3D models [27]. 

To summarize, we consider three main points: 

• the human stereo system may not directly compute 
3D depth, suggesting that humans may not need 
explicit depth; 

• small inaccuracies in measuring camera parameters 
can lead to large errors in computed depth, suggest- 
ing that we may not be able to compute explicit 
depth accurately; 

• the critical part of object recognition is fig- 
ure/ground separation, which may not require ex- 
plicit depth information. 

We will use this to argue that stereo can contribute 
to the efficient solution of the object recognition prob- 
lem, without the need for accurate calibration and with- 
out the need for explicit depth computation. In this 
case, the importance of eye movements or related con- 
trol strategies is increased, causing us to reexamine the 
structure of stereo algorithms. Similar questions have 
been by systems that use actively controlled stereo eye- 
head systems to acquire depth information (for example, 
[1, 5, 6, 7, 9, 20, 30, 38, 33]). 

3 Why Reconstruction is Too Sensitive 

While our first point is based primarily on earlier psy- 
chophysical observations, the second point bears closer 
examination. Let's look in more detail at the problem of 
computing distance from stereo disparity. Suppose our 
two cameras have points of projection located at hi and 
b r , measured in some world coordinate system. Assume 
that the optic axes are it and z r , and that both cameras 
have the same focal length / (though we could easily 
relax this to have two different focal lengths). 

In this case, we can represent the left image plane by 

{v| (v,z t ) = d t } 

where (., .) represents an inner (or dot) product. The 
principal point (or image center) is given by 
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t>£ + fit 



where we have chosen to place the image plane in front of 
the projection point, to avoid the inversion of the coordi- 
nate axes of the image. Since we know that this point lies 
on the image plane, we can deduce the constant offset, 
so that the left image plane is given by 

{v| (v -hi,zi) = /}. 

A similar representation holds for the right image plane. 
Now an arbitrary scene point p maps, under perspec- 
tive projection, to a point p f on the left image plane, 
given by 

, , /(p-b/) 
(p -hi,z t ) 
and for convenience we write this as 
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where (di,z,i) = 0. Here d^ is an offset vector in the 
image plane from the principal point: 
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z t x ((p -h t ) x z t ) 



P- 



Note that we haven't specified the world coordinate 
system yet, and we can now take advantage of that free- 
dom. In particular, we choose the origin of the world 
coordinate system to be centered between the projection 
points, so that hi = — b r = b. 

By subtracting d r from d^, we get the following rela- 
tionship 

(p -hi,zi)di - (p- b r ,z r )d r = 

f[-hi + h r - (p - hi,zi)zi + (p -b r ,z r )z r ](l) 

For the special case of the origin centered between the 
projection points, this becomes 



- h,zi)di 
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- (p + b,z r )d r 
(p -h,zi)zi + 
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(2) 



We can isolate components of p with respect to each 
of the two optic axes, by taking the dot product of both 
sides of equation 1 or 2 with respect to these unit vectors. 
This gives us two linear equations (assuming that z,i ^ 
z r ), which we can solve to find these components of p. 
Adding them together yields: 



{p,zi + z r ) = 

[{P + a(]) (b, zi - z r ) + 2/ (b, jizi - az r )] 



where 



a 



a/3-/ 2 

(d r + fz r ,zi) 
(di + fzi,z r ) . 



(3) 



To explore how this computation of depth from stereo 
measurements depends on the accuracy of the calibrated 
parameters and the disparity measurements, we consider 
the symmetric case of: 
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where x is chosen as the direction of the vector connect- 
ing the two centers of projection, and where the two cam- 
eras make a symmetric (though opposite signed) gaze 
angle 7 with the z axis, and where the offset of each 
camera from the origin is the same. In this case, substi- 
tution and manipulation leads to 



(p,z)cos7 



(4) 



26 (f 2 cos 2 7 + d r 



1 7) (/ 2 cos 2 7 



1 sin j) 



2 sin 7 (/ 2 cos 2 7 + d r di) — f (cos 2 7 — sin 7) (d r — di) 

where we have let 

d r = (d r ,z) 

di = (di,z). 

Note that in the special case of parallel optic axes (7 = 
0), this reduces to 



(p> z ; 
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which is exactly what one would expect, since di — d r is 
simply the disparity at this point. 

For convenience, call Z = (p,z). This equation tells 
us how to compute the depth Z, given measurements 
for the camera parameters /, b, 7 and the two principal 
points C£,c r as well as the individual measurements of 
displacement d^,d r (or equivalently di and d r ). 

The question we want to consider is how accurately 
do we need to know these parameters? There has been 
some previous analysis of stereo error in the literature, 
primarily focused on the effects of pixel quantization 
[43, 28, 25], although some analysis of the effects of cam- 
era parameters has also been done [45, 44]. Here we are 
primarily interested in the effects of the camera param- 
eters. 

For sake of simplicity, we will assume that 7 is small. 
For example, if the cameras are fixated at a target 1 
meter removed, with an interocular separation of 10 cm, 
then 7 ps .05 radians, or if the fixation target is .5 meters 
off, then 7 ps .1 radians. In the second case, the small 
angle approximation will lead to an error in cos 7 of at 
most .005 and an error in sin 7 of at most .0002. Using 
the small angle approximation leads to 

z „ 2i f + 7/(4 - di) 

~ 2 7 (/ 2 + d r di) - f(d r - di) { > 

If we rewrite this, isolating depth in terms of interocular 
units (26), and image offsets in terms of focal length (or 
equivalently in terms of angular arc), we get: 
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In some cases it is more convenient to consider this ex- 
pression in terms of relative units, that is representing 
depth in terms of interocular spacing, by using 

Z 
26 
and to use disparities as angular arcs by using 
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By taking partial derivatives of this equation with re- 
spect to each of the parameters of interest (which we 
treat as independent of one another), we arrive at the 
following expressions for the relative change in computed 
depth as a function of the relative error in measuring the 
parameters: 
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If we use standard viewing geometries (i.e. focal length 
much larger than individual pixel size, 7 small), we can 
approximate these expressions as follows: 
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We note that related error expressions were obtained 
in [43], although the focus there was on the effects of er- 
rors in the matching of image features and the quantiza- 
tion of image pixels on the accuracy of recovered depth. 

Our concern is how uncertainty in measuring the cam- 
era parameters impacts the computed depth. Ideally, we 
would like a linear relationship, so that, for example, a 
1 percent error in computing a parameter would result 
in at most a 1 percent error in depth. 

To explore this, we consider two cases: a camera sys- 
tem with 15mm focal length and .015mm pixels so that 
a pixel subtends an angular arc of .001 radians; and the 
human visual system, where the fovea has a receptor 
packing subtending approximately .00014 radians. 

By equation 8, relative errors in computed depth due 
to mismeasurement of the baseline separation are gen- 
erally quite small. For example, a 1% relative error in 
measuring the baseline will result in a 1% relative error 
in the computed distance. 



Equations 9 and 10 are essentially the same. They 
show a non-linear effect, in that the relative error in com- 
puting depth is a function both of the relative error in 
computing the position of each image point with respect 
to the global coordinate frame, and more importantly is 
a function of the distance of the object from the viewer, 
in units of interocular separation (2b). Thus, the relative 
error will get much worse for more distant objects. If we 
let the pixel error in measuring position be k, then using 
a standard pixel size and focal length, the relative error 
in depth is 

k Z 

103" 2b 

for our camera system. To see how large this can get, 
we need to understand what can contribute to k. Effects 
include: 

• image based localization errors 

• image based matching errors 

• registration errors between the image and the world 
coordinates due to: 

— principal points 

— image orientation 

Uncertainty and smoothing effects in the edge detec- 
tor will affect the first source of error, but typically will 
only cause errors on the order of a few pixels. Since 
matching errors by definition must lead to incorrect 
depth reconstructions, we ignore them in our analysis. 
The second major source of error comes from convert- 
ing the image pixel measurements to world coordinates, 
and here there are two main sources. One is that all of 
our disparity measurements in the analysis above were 
based on the displacement of features from the principal 
points. This requires that we measure those principal 
points accurately [21], and this is particularly important 
since in many cameras, the principal point can often be 
tens of pixels away from the center of the sensor array. 
For example, the CCD cameras in use in one of our stereo 
setups have principal points displaced from the image ar- 
ray center by 30 pixels in x and 1 pixel in y for the left 
camera and 18 pixels in x and 3 pixels in y for the right 
camera. Methods in the literature for locating the prin- 
cipal points [21] are reported to have residual errors of 
at most 6 pixels. 

Finally, we need to know the orientation of the camera 
rasters with respect to the world axes. Even if we ignore 
the effects of gaze angle, rotation about the optic axis 
(cyclotorsion) can result in an error in the disparity offset 
with respect to the interocular baseline. Since this error 
goes with the cosine of the rotation, we expect the effects 
of such error to be small. 

If we have found the principal points and the orien- 
tation of the cameras with respect to world coordinates 
accurately, then k will typically be on the order of a few 
pixels. If we have not, k can easily be on the order of 
tens of pixels. To see the effect of this on reconstructed 
depth, Figure 2 shows plots of the percentage relative 
error in computing depth, as a function of the distance 
to the object (measured in units of interocular separa- 
tion), for the case of k = 1 and k = 10. For an object 
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Figure 2: Vertical axis is the percentage error in comput- 
ing depth, horizontal axis is the distance to the object (in 
units of interocular separation). Top graph is for errors 
in localizing image features of 10 pixels, bottom graph 
is for 1 pixel errors. 



1 meter away from our standard camera setup, k = 10 
leads to 10% errors in computed depth. For the human 
system, these errors are reduced by a factor of 10. A 
second way of seeing this is to ask what is the accuracy 
on pixel location needed to keep the relative depth error 
less than 1%, as a function of the distance to the object. 
This is shown in Figure 3. 

By equation 11, a 1 percent error in estimating / and 
disparities on the order of 10 pixels, will still only lead 
to 1 percent errors in relative depth for nearby objects 
(Z/2b ps 10), which is small. Note that as the disparities 
get larger, the error increases. This has the interesting 
implication that if the object of interest is roughly fix- 
ated (i.e. the two optic axes intersect at or near the 
object) then disparities for features on the objects will 
be small, and the depth error will be small, while objects 
at larger disparities will have larger errors. Note that a 
similar observation has been made by Olson [31] who 
shows that much of the sensitivity of depth reconstruc- 
tion to camera parameters can be isolated in the compu- 
tation of the depth of the fixation point, while relative 
depth of other points with respect to this fixation can be 
computed fairly accurately. 

All of this analysis is encouraging. Consider equation 
12, however. Here, a 1 degree error in estimating the 
gaze angle will lead to 34 percent relative depth errors 
for nearby objects (Z/2b ps 10), and even a .5 degree gaze 
angle error will lead to 17 percent relative depth errors. 
This is graphed in more detail in Figure 4. Similarly, in 
Figure 5, we plot the accuracy in gaze angle needed to 
keep the relative depth error at most 1%, as a function 



Figure 3: Vertical axis is the accuracy in pixel location 
needed so that the relative error in depth is less than 1%, 
horizontal axis is the distance to the object (in units of 
interocular separation). 



of distance to the object. 

We note that errors due to gaze angle calibration 
could be a real problem. It is interesting to note that 
the human system appears able to measure gaze angle 
only up to an accuracy of roughly 1 degree [16] (page 
67). 

In short, we need to be certain that we have estimated 
the principal points accurately, and that we have very 
accurate measurements of the gaze angles of the cam- 
eras. If we cannot do so, then we will suffer distortion in 
our computed depth. More importantly, that distortion 
varies with actual depth, so the effect is non-linear. If we 
are trying to recognize an object whose extent in depth 
is small relative to the distance to its centroid, then the 
effect of this noise sensitivity is reduced. This is because 
the effect of the error will be systematic, and in the 
case of small relative depth, this uncertainty basically 
becomes a constant scale factor on the computed depth. 
On the other hand, however, if the object has notice- 
able relative extent in depth (even on the order of a few 
percent), then the uncertainty in computing depth will 
skew the results, causing difficulties for most recognition 
methods that compare computed 3D structure against 
stored models. Thus, the sensitivity may cause serious 
problems for recognition methods, both due to the large 
errors in depth and due to the distortions with varying 
depth. 

4 Another Look at Stereo 

Given that it may be difficult to reliably compute dis- 
tance, and that distance may not be needed to handle 



Relative depth error vs. Object dista 




6.0 8.0 10.0 12.0 14.0 16.0 18.0 20. C 



Figure 4: Vertical axis is the percentage error in comput- 
ing depth, horizontal axis is the distance to the object 
(in units of interocular separation). Graphs are for er- 
rors in computing the gaze angle of 1, .5 and .25 degrees, 
from top to bottom. 



Accuracy limit on gaze error vs object distance 
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Figure 5: Vertical axis is the accuracy in gaze angle (in 
degrees) needed so that the relative error in depth is less 
than 1%, horizontal axis is the distance to the object (in 
units of interocular separation). 



the two main uses of stereo output, we suggest that 
it is useful to reconsider the performance requirements 
that stereo should satisfy to support tasks such as ob- 
ject recognition. To handle figure/ground separation, a 
stereo algorithm should: 

• be able to detect proximal (in the image) features 
that lie within some range of depth (i.e. find points 
that are near one another in 3D space, even if one 
does not know exactly where in 3D), 

• be able to align matching distinctive features so 
that they are centered in the two images, to ensure 
that nearby parts of the corresponding object are 
visible in both images and can be matched, 

• be able to integrate other visual cues about possible 
trigger features to foveate and fixate. 

First, we should consider whether we can use exist- 
ing stereo algorithms (e.g. [10], [4], [26], [36], [14]) to 
tackle the problem of figure/ground separation. We can 
conveniently separate stereo processing into several com- 
ponents: 

• Choice of features to match: for our discussions, 
we will consider only edge based stereo matching. 

• Constraints on the matching process. 

• Control mechanism used to guide the matching 
process. 

Most current stereo algorithms solve the correspon- 
dence problem as follows: Given any left image edge, 
search the set of right image edges for a unique match. 
The search is usually constrained by the (assumed 
known) epipolar geometry, and by a set of similarity 
constraints (e.g. edges should have similar orientation, 
similar contrast (or intensity variation), and so on). This 
holds both for matching individual edge points (in which 
case additional constraints such as figural continuity may 
also apply) and for extended edge fragments. 

The key question is what constitutes a unique match, 
and this depends on the control mechanism used by the 
algorithm. For example, most of these algorithms at- 
tempt to find matches over a wide range of disparity, 
reflecting the fact that the viewed scenes may have ob- 
jects ranging from close to the viewer (less than 1 meter) 
out to objects at the horizon. This can easily translate 
into disparity ranges on the order of several hundred pix- 
els. The problem is that under these circumstances, it 
may be very difficult to guarantee uniqueness of match, 
especially when one is only considering local attributes 
of features, such as orientation and local contrast. One 
solution is to incorporate local geometric information 
about nearby edges [3], [29]. But an alternative is to 
consider changing the control mechanism. 

The key problem is that previous stereo algorithms 
had as their goal the reconstruction of the scene, and 
hence they were designed to find as many correct 
matches as possible, over a wide range of disparities. 
On the other hand, if all we are interested in is sepa- 
rating out candidate image features that are likely to 
correspond to a single object, and we are willing to al- 
low edge features to participate in several such groups, 



then an alternative control method is viable. In particu- 
lar, since we are interested in finding roughly contiguous 
3D regions, it is attractive to envision a control method 
in which one fixates at some target, then searches for 
matching features within some range of disparity about 
that fixation point, collecting all such matching features 
as a candidate object, and continues. 

Such an algorithm is similar in approach to some ear- 
lier stereo methods, notably [23, 27, 3], and it bears 
some similarity to evidence of the human stereo system, 
particular in the restriction of matching disparities only 
over a narrow range about the fixation point (referred 
to as Panum's limit in the perceptual literature) and the 
role of eye movements in guiding stereo [23, 27, 31]. It 
also clearly relates to work in active stereo head systems 
[1, 5, 6, 7, 9, 20, 30, 38, 33], especially work on using 
saliency of low level cues, or using motion information 
to drive stereo control loops that fixate candidate target 
areas [9, 6, 5, 30, 38, 33]. 

To demonstrate this idea, we have implemented the 
following stereo algorithm (influenced in part by earlier 
algorithms [3], [29]). 

• Process both images to extract intensity edges. For 
convenience, process these edges to extract linear 
segments, using a standard split-and-merge algo- 
rithm. This latter step is mainly for reduction in 
computation and is not central to the demonstra- 
tion. 

• For each linear feature segment, record the position 
of the two endpoints, and the average intensity on 
each side of the feature. Also record the distance 
from each endpoint to other nearby features. 

• Find a distinctive feature in one image that has a 
unique match in the other image, as measured over 
the full range of possible disparities. To begin with, 
we will measure distinctiveness as a combination of 
the length of the feature and the contrast of the 
feature. The idea is that such a feature can serve 
as a focal trigger feature. Of course many other 
cues could serve to focus attention [22]. 

• Rotate both cameras so that the distinct feature 
and its match are both centered in the cameras. 
This is a simple version of a fixation mechanism, in 
which the trigger feature is foveated and fixated in 
both cameras. Note that this will in general cause 
the optic axes to be non-parallel so that epipolar 
lines will no longer lie along horizontal rasters. A 
simpler version just uses a pan and tilt motion of 
the cameras to center the feature in one image, 
while leaving the optic axes parallel. 

• Within a predefined range of disparity ±<5 
(Panum's limit) about the zero disparity position 
(due to fixation), search for other features that have 
a unique match. Note that uniqueness here means 
only within this range of disparity. There may be 
other edges outside of this disparity range that sat- 
isfy the matching constraints, but in this case such 
matches are ignored. In our implementation, two 
edges match if their lengths are roughly the same, 



if a significant fraction of each edge has an epipo- 
lar overlap with the other edge, if the orientation 
is roughly the same, if the average intensity on at 
least one side of the edge is roughly the same, and 
if the arrangement of neighbouring edges at one of 
the endpoints is roughly the same. 

• This set of edges now consistutes an hypothesized 
fragment of a single object. We can save these 
edges, and continue the process, looking for an- 
other unique trigger feature to align the cameras. 
Alternatively, we can pass these edge features on to 
a recognition algorithm, such as Alignment [17, 18]. 

We have implemented an initial version of this algo- 
rithm, and used it in conjunction with an eye-head sys- 
tem, which can pan and tilt as a unit, as well as change 
the optic axes of one or both cameras. An example of 
this algorithm in operation is shown in Figures 6-11. 
Given the images in Figure 6, we extract edges (Figure 
7). From this set of edges, the most distinctive edge 
(measured as a combination of length and intensity con- 
trast) with a unique match is isolated in Figure 8. This 
enables the cameras to fixate the edge and obtain a new 
set of images (Figure 9) and edges (Figure 10). Relative 
to this fixation, stereo matching is performed over a nar- 
row range of disparity, isolating a set of edges likely to 
come from a single object (Figure 11). Notice how the 
tripod is extracted from the originally cluttered image, 
with minimal additional features. 

5 Conclusions 

We have suggested that stereo may play a central role 
in object recognition, but not in the manner usually as- 
sumed in the literature. We have suggested that stereo 
may be most useful in supporting figure/ground separa- 
tion, and that to do so it need not compute explicit 3D 
information. Supporting this argument were the obser- 
vation that depth reconstruction is extremely sensitive 
to accuracy in the measured camera parameters, and the 
observation that the human stereo system may not com- 
pute explicit depth. 

Using the idea of depth detectors tuned to a nar- 
row range about a fixation point has been previously 
explored in the literature, primarily for obstacle avoid- 
ance [15], [32]. This work considers the same general 
idea within the context of recognition. Such an approach 
opens up several other avenues for investigation. For ex- 
ample, what is the role of other visual cues in aiding the 
stereo matching problem. While one option is to aug- 
ment image features with attributes, such as texture or 
color measures, an alternative is to consider using such 
cues to drive vergence eye movements, helping to align 
the cameras on trigger features, so that the local matcher 
can extract image features likely to correspond to a sin- 
gle object. We intend to explore these and related issues 
in the near future. 
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